Bootstrapping and Multiple Imputation Ensemble Approaches for Missing Data
نویسندگان
چکیده
Correspondence *Corresponding author Email: [email protected] Presence of missing values in a dataset can adversely affect the performance of a classifier; it deteriorates rapidly as missingness increases. Single and Multiple Imputation (MI) are normally performed to fill in the missing values. In this paper, we present several variants of combining MI and bootstrapping to create ensemble that can model uncertainty and diversity in the data and that are robust to high missingness in the data. We present three ensemble strategies: bootstrapping on incomplete data followed by single imputation and MI, and MI ensemble without bootstrapping. We use mean imputation, Gaussian random imputation and expectation maximization as the base imputation methods to be used in these ensemble strategies. We perform an extensive evaluation of the performance of the proposed ensemble strategies on 8 datasets by varying themissingness ratio. Our results show that bootstrapping followed by average ofMIs using expectation maximization is the most robust method that prevents the classifier’s performance from degrading, even at high missingness ratio (30%). For small missingness ratio (up to 10%) most of the ensemble methods perform equivalently but better than their single imputation counterparts. Kappa-error plots suggest that accurate classifierswith reasonable diversity is the reason for this behaviour. A consistent observation in all the datasets suggests that for small missingness (up to 10%), bootstrapping on incomplete data without any imputation produces equivalent results to other ensemblemethods with imputations.
منابع مشابه
چند رویکرد برخورد با مقادیر گمشده متغیرهای کمی و بررسی اثر آنها بر نتایج حاصل از یک کارآزمایی بالینی
Background and Objectives: A major challenge that affects the longitudinal studies is the problem of missing data. Missing in the data may result in the loss of part of the information which reduces the accuracy of the estimator and obtain the results will be biased and inaccurate. Therefore, it is necessary to evaluate the missing data mechanism from a longitudinal research and to consider thi...
متن کاملAccuracy evaluation of different statistical and geostatistical censored data imputation approaches (Case study: Sari Gunay gold deposit)
Most of the geochemical datasets include missing data with different portions and this may cause a significant problem in geostatistical modeling or multivariate analysis of the data. Therefore, it is common to impute the missing data in most of geochemical studies. In this study, three approaches called half detection (HD), multiple imputation (MI), and the cosimulation based on Markov model 2...
متن کاملSelection of Variables that Influence Drug Injection in Prison: Comparison of Methods with Multiple Imputed Data Sets
Background: Prisoners, compared to the general population, are at greater risk of infection. Drug injection is the main route of HIV transmission, in particular in Iran. What would be of interest is to determine variables that govern drug injection among prisoners. However, one of the issues that challenge model building is incomplete national data sets. In this paper, we addressed the process ...
متن کاملWorking Paper 34 UNITED NATIONS ECONOMIC COMMISSION FOR EUROPE CONFERENCE OF EUROPEAN STATISTICIANS
1. Missing data problems are ubiquitous in many fields, including official statistics, where one of the common treatments of missing data is ratio imputation (de Waal et al., 2011; Thompson & Washington, 2012; Office for National Statistics, 2014). On the other hand, multiple imputation has been the recommended practice from statisticians (Rubin, 1987; Little & Rubin, 2002). Among statisticians...
متن کاملComparison of imputation variance estimators
Appropriate imputation inference requires both an unbiased imputation estimator and an unbiased variance estimator. The commonly used variance estimator, proposed by Rubin, can be biased when the imputation and analysis models are misspecified and/or incompatible. Robins and Wang proposed an alternative approach, which allows for such misspecification and incompatibility, but it is considerably...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1802.00154 شماره
صفحات -
تاریخ انتشار 2018